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PREDICTION ERROR OF CROSS-VALIDATED LASSO 


SOURAV CHATTERJEE AND JAFAR JAFAROV 


Abstract. In spite of the wealth of literature on the theoretical properties of the Lasso, there is 
very little known when the value of the tuning parameter is chosen using the data, even though 
this is what actually happens in practice. We give a general upper bound on the prediction error 
of Lasso when the tuning parameter is chosen using a variant of 2-fold cross-validation. No special 
assumption is made about the structure of the design matrix, and the tuning parameter is allowed 
to be optimized over an arbitrary data-dependent set of values. The proof is based on a general 
principle that may extend to other kinds of cross-validation as well as to other penalized regression 
methods. Based on this result, we propose a new estimate for error variance in high dimensional 
regression and prove that it has good properties under minimal assumptions. 


1. Introduction 

Since its introduction by Tibshirani [H], the Lasso has become one of the most popular tools 
for high dimensional regression. Although most readers of this article will undoubtedly be familiar 
with the Lasso, let us still describe the setup for the sake of fixing notation. Consider the linear 
regression model 

Y = X(3 + e, (1.1) 

where A is an n x p design matrix, /3 is a p x 1 vector of unknown parameters, e is a n x 1 vector 
of i.i.d. A(0, cr^) random variables (where <7^ is an unknown parameter called the ‘error variance’), 
and Y is the n x 1 response vector. When p is small and n is large, ordinary least squares is an 
effective tool for estimating the parameter vector (3. However, many modern applications have 
the characteristic that both n and p are large, and sometimes p is much larger than n. The 
Lasso prescribes a way of estimating /3 in this scenario. There are two equivalent versions of the 
Lasso, namely, the primal version and the dual version. In the primal version, the statistician 
chooses a tuning parameter K, and produces an estimate {3 by minimizing \\Y — A/?|| subject to 
|/S|i < K, where || • || is the Euclidean norm in and | • |i is the norm in M^. In the dual 
version, the statistician chooses a tuning parameter A, and produces an estimate £3 by minimizing 
||y — A/3|p + A|/I|i. Note that the optimal (3 may not be unique in either version. Although the 
Lasso was introduced in its primal form in Tibshirani’s paper [U], the dual form has become more 
popular due to algorithmic efficiency. There is no universally accepted prescription for choosing the 
tuning parameter in either of the two versions. In practice, the tuning parameter is almost always 
chosen in a data-dependent manner, often using cross-validation. 

There is a large body of literature on the theoretical properties of the Lasso. Instead of trying to 
give a comprehensive overview, we will just highlight some essential references from this literature. 
The analysis of basis pursuit by Chen, Donoho and Saunders |14j and the papers of Donoho and 
Stark [18] and Donoho and Huo m provided some key ideas for subsequent authors. Knight and 
Fu |30| proved consistency of /3 under the assumption that p remains fixed and n —oo. When 
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both p and n tend to infinity, the consistency of /S has no traditional definition. Greenshtein and 
Ritov [24] defined a notion of consistency in this setting that they called ‘persistence’, and proved 
that under a set of assumptions, the Lasso estimator is persistent. Persistence of a sequence of 
estimators is defined as follows. Rewrite the linear regression model (jl.ll) as 

yi = Xil3 + ei, i = (1.2) 

where xi,... ,Xn are n rows of the design matrix X, ei... ,£n are the components of the error vector 
e, and yi,... are the components of the response vector Y. Suppose that the pairs {yi,Xi) are 
i.i.d. draws from some probability distribution onM x M^. For any estimator (3, define 

Ln0)-.= E{y-xPf, (1.3) 

where (y, x) is a pair drawn from that is independent of (yi, xi),..., (y„,, x^). Similarly for any 
/3 G define L„(/3) := E(y — x/3)^. If we have a sequence of regression problems as above and 
estimators /?„, Greenshtein and Ritov |2l| called the sequence /3„ ‘persistent’ if 

lim (L„(/3n) - inf L„(/3)) = 0 . 

n—>-oo 0 

Persistence has become a popular notion of consistency in high dimensional problems. It is some¬ 
times called ‘risk consistency’. It has been the topic of investigation in several subsequent papers 
on the Lasso, such as Bunea et al. mm and van de Geer m- Quantitative bounds were given 
in Zhang [52], Rigollet and Tsybakov m, Biihlmann and van de Geer [8], Bartlett et al. [3] and 
Chatter] ee mm- 

Another kind of consistency that has been investigated in the context of Lasso is model selection 
consistency. The Lasso estimator has the property that often, most of the coordinates of /3 turn 
out to be equal to zero. The nonzero coordinates therefore do an automatic ‘model selection’. 
Consistency of model selection by Lasso under a variety of assumptions on the design matrix and 
the sparsity of /3 was investigated by Zou [H], Donoho et al. m, Wainwright |49j . Meinshausen 
and Biihlmann |34j . Meinshausen and Yu [35], Bickel et al. [5], Massart and Meynet |33j and Zhao 
and Yu [53] . 

In all of the above papers, the tuning parameter is considered to be deterministically chosen. 
For example, Greenshtein and Ritov m showed that if the tuning parameter K in the primal form 
of the Lasso grows like o((n/log n)^/^), then the Lasso estimator is persistent. This is the general 
flavor of subsequent results, such as those in [IKinillSllMllSllliTlIigilSllM]. 

However, as mentioned before, this is usually not how the tuning parameter is chosen in practice. 
Systematic ways of deterministically choosing the tuning parameter have been proposed by Wang 
and Leng [50], Zou et al. [56] and Tibshirani and Taylor [46] . Other authors, such as Tibshirani mi 
145] . Greenshtein and Ritov [23], Hastie et al. [25], Efron et al. [19] . Zou et al. [56], van de Geer 
and Lederer [38], Fan et al. m and Friedman [22] recommend using cross-validation to select the 
value of the tuning parameter. In practice, the tuning parameter is almost always chosen using 
some data-dependent method, often cross-validation. 

In view of the above, it is surprising that there are very few rigorous results about the Lasso 
when the tuning parameter is chosen through cross-validation. In fact, the only results we are 
aware of are from some recent papers of Lecue and Mitchell m and Homrighausen and McDon¬ 
ald |26|, I2T[ I28j . Tlic paper m aims to build a general theory of cross-validation in a variety of 
problems; unfortunately, it seems that for cross-validation in Lasso, it requires that the vector x 
of explanatory variables is a random vector with a log-concave distribution. This is possibly too 
strong an assumption to be practically useful. The papers [sniniiEE] have more relaxed assump¬ 
tions. Roughly speaking, the main resnlt of m goes as follows. Consider the primal form of 
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Lasso, and suppose that the tuning parameter K is chosen from an interval [0, iL„] using r-fold 
cross-validation, where r is a fixed number and Kn is determined according to a formula given in 
m- Let K be the optimized value of K, and let he the Lasso estimate of (3 for this value of 
the tuning parameter. Suppose that {yi,Xi) are i.i.d. from some distribution and let Ln{l3j^) 
be defined as in (II.3p . Additionally suppose that p is growing like n“ for some positive a, and that 
the minimum nonzero singular value of the design matrix X is, with high probability, of order 
or bigger. Under these conditions and a few other assumptions, the main result of m states that, 
as n —>■ oo. 


Although this is laudable as the first mathematical result about any kind of consistency of cross- 
validated lasso, there are a number of unsatisfactory aspects of this result. For the theorem to be 
effective, we need 


Kn = {log n) , (1.4) 

which is too slow for all practical purposes. Usually, the tuning parameter is optimized over a 
fairly large range, determined by the statistician using some ad hoc data-dependent rule. The 
second issue is that the value of Kn is prescribed by a formula given by the authors, which may 
not end up satisfying (II.4|) . Lastly, the condition on the smallest nonzero singular value of the 
design matrix looks a bit restrictive, especially since it has been observed in several recent papers 
that risk consistency in Lasso (with deterministic value of the tuning parameter) holds without any 
conditions on the design matrix [alElliaET!. 

The goal of this paper is to address these issues and prove a new and better upper bound on the 
prediction error of cross-validated Lasso under fewer assumptions. The main result is presented in 
the next section. An application to error variance estimation is worked out in Section [3l 

Incidentally, there is a substantial body of literature on cross-validation for ridge regression, the 
classical cousin of Lasso. A representative paper from this literature, for example, is the highly 
cited article of Golub, Heath and Wahba [23]. The techniques and results of these papers depend 
heavily on the friendly mathematical structure of ridge regression. They do not seem to generalize 
to other settings in any obvious way. 

Similarly, classical results on cross-validation such as those of Stone [Ml SO], apply only to 
problems where p is hxed, and are therefore not relevant in our setting. 

It is important to point out that the Lasso is not the only technique for high dimensional 
regression under sparsity assumptions. Numerous methods have been proposed in the last twenty 
years. Many of them, like the Lasso, are based on the idea of performing regression with a penalty 
term. These include basis pursuit [T3|, SCAD [21], LARS [19], elastic net [55] and the Dantzig 
selector m- The Lasso itself has been sophisticated over the years, yielding variants such the 
group Lasso m, the adaptive Lasso [54] and the square-root Lasso [1]. Penalized regression is 
not the only approach; for example, methods of model selection by testing hypotheses have been 
proposed in [HllllZlE]- Most of these methods involve some sort of a tuning parameter, which 
is often chosen using cross-validation. The techniques of this paper may be helpful in analyzing 
cross-validation in a variety of such instances. Our reason for focusing on the Lasso is simply to 
choose one test case where the proof technique may be implemented, and the Lasso seemed like a 
natural choice because of its popularity among practitioners. 
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2. Main result 


Consider the linear regression model (ll.ip . Let Xi be the ith row of X, yi be the ith component 
of Y and be the ith component of e, so that (ll.2p holds. In our setup (unlike [M]), the row 
vectors xi,... ,Xn are non-random elements of 

The mean squared prediction error of an estimator /3 is defined as 


MSPE(/3) := E 


\\X/3*-Xpf 


n 


where /?* is the true value of /3. 

Our goal is to produce an estimate of /3 using the Lasso procedure and optimizing the tuning 
parameter by a certain variant of 2-fold cross-validation, and then give an upper bound for its mean 
squared prediction error. The specific algorithm that we are proposing is the following. 

1. Divide the set n} randomly into two parts I and P, by independently putting each 

element into either I or with equal probability. 

2. For each K > 0, let a minimizer of 


'^{Vi - Xil3f 
iei 


subject to |/3|i < K and let be a minimizer of 

subject to |/3|i < K. If there are multiple minima, choose one according to some deterministic 
rule. In the rare event that I or is empty, define the corresponding /3’s to be zero. 

3. Let Ni and N 2 be two nonnegative integer-valued random variables, where Ni is a function of 
{yi,Xi)i£ic and N 2 is a function of {yi, Xi)i^i. These numbers will determine the range over which 
the tuning parameter is optimized in the next step. The choice of Ni and N 2 is left to the user. 

4. Let 5 be a positive real number, to be chosen by the user. Let Ki be a minimizer of 

iei 


as K ranges over the set {0, 5,26,... , N 16 }. Let K 2 be a minimizer of 

as K ranges over the set {0,5,2(5,... , N 26 }. 

5. Traditional cross-validation produces a single optimized K instead of iLi and K 2 as we did 
above. In this step, we will combine iLi and K 2 to produce a single K, as follows. Define a 
vector fi' G M”" as 

For each K, let be a minimizer of ||y — X/3|| subject to |/3|i < K. Let be a minimizer 
oi\W over K > 0. 

6. Finally, define the cross-validated estimate := 
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The following theorem gives an upper bound on the mean squared prediction error of under 
the condition that Ni 6 and N 26 both exceed |/3*|i. This is the main result of this paper. In the 
statement of the theorem, we use the standard convention that for a random variable X and an 
event A, E(X;^) denotes the expectation of the random variable XIa, where = 1 if ^4 happens 
and 0 otherwise. 


Theorem 2.1. Consider the linear regression model dni). Let (3^^ be defined as above and fi* be 
the true value of fi. Let L := |/3*|i + 5, and let 

1 

M := max - xfj , li := Elog(IVi + 1), h ■= Elog(A ^2 + 1) • 

i<j<P n ^^ 
i=l 


Then 


E 


\\Xfi* ^ ^ \ ^ ^ 


n 


n 


log(2p) 


n 


+ En , 


where 


Cl = 16(4o-^ + , 

C2 = , 

and En is the exponentially small term 


16 




Remarks 

1. To understand what the theorem is saying, think of L, M and cj as fixed numbers that are not 
growing with n. Also, think of <5 as tending to zero (or at least remaining bounded), so that 
L is basically the same as |/3*|i. Then the theorem says that as long as Ni5, N 26 and p tend 
to infinity slower than exponentially with n, the mean squared prediction error tends to zero as 
n —/ 00 . In fact, the prediction error goes to zero even if is allowed to grow, as long as it 
grows slower than A^id, A"2(5, and n^/'^(logp)“^/^. The last criterion is a familiar occurrence in 
papers on the consistency of Lasso, as mentioned before. 

2. There are five significant advances that Theorem 12.11 makes over existing results on cross- 
validated Lasso [23 [28] : 

(i) The range of values over which the tuning parameter is optimized is allowed to grow 
exponentially in n. 

(ii) The range of optimization is allowed to be arbitrarily data-dependent. 

(iii) The error bound depends on the norm of the true fi, and not on the range of values over 
which the tuning parameter is optimized (except through a logarithmic factor). 

(iv) The theorem imposes essentially no condition on the design matrix. 

(v) The theorem gives a concrete error bound on the prediction error instead of an asymptotic 
persistence proof. 

3. Theorem [Q is silent on how to choose Ni, N 2 and <5; the choice is left to the practitioner. 
Clearly, for the upper bound to be meaningful, it is necessary that with high probability both 
Nid and N 2 d exceed |/3*|i. This is not surprising, because if the range of values over which the 
tuning parameter is optimized does not contain the norm of the true fi, then primal Lasso is 
unlikely to perform well. It is not clear whether one can produce a choice of the range that has 
a theoretical guarantee of success. 
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4. It would be nice to have a similar result for the versions of cross-validation that are actually 
used in practice. The main reason why we work with our version is that it is mathematically 
tractable. It is possible that a variant of the techniques developed in this paper, together with 
some new ideas, may lead to a dehnitive result about traditional cross-validation some day. 

The next section contains an application of Theorem 12 .1 1 to error variance estimation. Theorem 12. II 
is proved in Section [H and the results of Section [3] are proved in Section [5l 

3. Application to error variance estimation 

Error variance estimation is the problem of estimating in the linear regression model m- 
In the high dimensional case, this problem has gained some prominence in recent times, partly 
because of the emergence of literature on significance tests for Lasso [29l |32]. Significance tests 
almost always require some plug-in estimate of cr^. There are a number of proposed methods for 
error variance estimation in Lasso |15l 1201 [38l 1411142l 143] . The recent paper of Reid, Tibshirani and 
Friedman |36] gives a comprehensive survey of these techniques and compares their strengths and 
weaknesses through extensive simulation studies. They summarize their hndings as follows (italics 
and quotation marks added): 

“Despite some comforting asymptotic results, finite sample performance of these 
estimators seems to suffer, particularly when signals become large and non sparse. 
Variance estimators based on residual sums of squares with adaptively chosen regu¬ 
larization parameters seem to have promising finite sample properties. In particular, 
we recommend the cross-validation based. Lasso residual sum of squares estimator 
as a good variance estimator under a broad range of sparsity and signal strength 
assumptions. The complexity of their structure seems to have discouraged their rig¬ 
orous analysis. ” 

Reid, Tibshirani and Friedman [36j observe that it is possible to construct an error variance estima¬ 
tor using the cross-validation procedure of Homrighausen and McDonald m, but it is consistent 
only when s/n —)• 0, where s is the number of nonzero entries in the cross-validated Lasso estimate fi. 
Such a result does not follow from any known theorem in the literature. 

In an attempt to address the above problems, we propose a new estimate of error variance based 
on our cross-validated Lasso estimate. Let be the cross-validated estimate of fi dehned in 
Section [2j Define 

.2 WY-Xfi^^W^ 

(j := -. 

n 

The following theorem gives an upper bound on the mean absolute error of It gives a theoretical 
proof that is a good estimator of cr^ under the same mild conditions as in Theorem 12.11 Whether 
it will actually perform well in practice is a different question that is not addressed in this paper. 

Theorem 3.1. Let all notation be as in Theorem [Q and let d^ he defined as above. Let R denote 
the right-hand side of the inequality in the statement of Theorem 12.11 Then 

Edd^-u^l; Ni6>\fi*\i, N 26 >\fi*\i) <a\ - + 2aVR + R. 

V n 

The following corollary demonstrates a simple scenario under which d^ is a consistent estimate 
of Note that Theorem [3T] is more general than this illustrative corollary. 

Corollary 3.2. Consider a sequence of regression problems and estimates of the type analyzed in 
Theorems [Q and 13.11 Suppose that is the same in each problem, but n —>■ oo and all other 
quantities change with n. Assume that: 
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(i) The entries of the design matrix and the norm of the true parameter vector j3* remain 
uniformly bounded as re —>■ oo. 

(a) Ni5 and N 2 S tend to infinity in probability, but S remains bounded. 

(Hi) log logA ^2 and \ogp grow at most like o{n). 

Then is a consistent estimate of in this sequence of problems. 

Theorem l.S.ll and Corollarv l3.2l are proved in Section[5l In the next section, we prove Theorem l2.ll 


4. Proof of Theorem 12.11 


In the statement of the theorem, L is defined as |/3*|i + 5. However in this proof, we will redefine 
L as the smallest integer multiple of 6 that is > |/3*|i. It suffices to prove the theorem with this 
new L, because the old L is > the new one. 

Let /j* := X/3* and f, := . We will first work on bounding \\fi' — pL*\\ instead of \\fi — /U*||. 

If 1“^ is nonempty and Ni5 > |/3*|i, then by definition of Ki, 

< YiVi - ^ 

ie/ ieL 

If or I is empty, then equality holds, so the above inequality is true anyway. Adding and 
subtracting xifd* inside the square on both sides gives 

+ 2ei(xi/3* - + {xi(3* - 

iei 

- + {xiP* - . 

iei 

This can be rewritten as 

Y(^^^ - 

ie/ iei 

<2Y ^ Yixifd* - . (4.1) 

iei iei 

A similar expression may be obtained for ~ h'if'■ Since these two quantities have the 

same unconditional distribution and their sum is the total prediction error ||/i* — /i'|p, it suffices to 
obtain a bound on the expectation of one of them. We start by bounding expectation of the second 
term in (BTl). Throughout the remainder of the proof, let E' denote conditional expectation given 
I and E" denote the conditional expectation given I and {yi,Xi)i^i<^. 


Lemma 4.1. For each (5 E let 


i&I 

Then 

E( sup ‘/?(/?)) < 3L^(2Mrelog(2p^))^/^. 

\/3mP:\0\i<L J 


Proof. Note that we can write 

n 

T{P) = Y ~ ’ 

i=l 
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where 


Vi = 


1 if i G /, 
-1 


Note that rn are i.i.d. random variables with mean zero. Write 

n n n n 

J2vi{xil3* - Xij3f = -2j2vi{xi/3*){xi/3) + J2vi{xil3f . 


First, observe that 


= 0 . 


Next, note that for any /3 with |/3|i < L, 

n P / n 

-2j2vi{xil3*)ixil3) = -2j2/3j(S2viixif3*)i 

j=i ^i=i 


1=1 


< 2L max 


'^r]i{xif3*)xij 


1 = 1 


Lemma I A. 1 1 from the Appendix implies that for any 0 G M, 

E^exp^6»^77j(xi/3*)xij^^ < exp^y . 

So by Lemma lA.21 of the Appendix, 

/ ^ 

y^^r]i{xif3*)x 


E( max 

Vi<i<P 


< (21og(2p) max '^{{xi(3*)xij)' 
^ 2 = 1 


1/2 


Note that for any j, 


n / p 


'^{{xil3*)xijf = ^(^XikXijl3l 


2=1 


2=1 ^k=l 


Y XikXiixlPll3l 

2=1 l<k,l<p 


< 1/3*1? max 
l<k,l<p 


^ ^ XikXilX 


2=1 


< nM\l3*\l 


Therefore by (14.41) . 


E( max 

Vi<j<p 


Yviixil3*)xij j < (2Mn|/3*|?log(2p))^/2 

i=l ^ 


Using this information in (14.31) . we get 


E( sup -2 Vryi(xi,3*)(xi/1)) < 2L2(2Mnlog(2p))V2. 

«:|/3|i<L / 


(4.2) 


(4.3) 


(4.4) 


(4.5) 
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Similarly, for any /3 with \j5\i < L, 


^ I3jl3k[ ^ViXijXik 
j,k=l 


i=l 


< max 


2 = 1 
n 


^ ^ Vi^ij^ik 


2=1 


l<j',fc<p 

By Lemma [A. II from the Appendix, we have that for any 0 E M, 

/02 - 


(4.6) 


2=1 


E^exp^eY^rjiXijXikJJ < exp(y 


Therefore by Lemma lA.21 of the Appendix, 


E( 


max 

\l^j,k<p 


\ 1/2 


] < ( 21og(2p^) max '^{xijXik}'^ ) < (2Mnlog(2p^))^/^ . 


Therefore by (14.61) . 


e( sup y'r/i(rEi/3)^) < L^(2Mnlog(2p^))^/^. 
V/3eMP:|/3|i<L^ / 

By combining (14.2p . (|4.5I) and (14.71) we get the desired result. 

Lemma 4.2. 

< 2Lcj(2M^/2^1og(2p))^/2 ^ 


(4.7) 

□ 


Proof. The inequality is trivially true if is empty. So let us assume that is nonempty. By 
definition of 

'^{Vi - <'^{yi- Xi/3*f . 

Adding and subtracting Xij3* inside the bracket on the left, this becomes 

+ 2£i{xil5* - + {Xij3* - e- 

which is the same as 

Y^ixiP* - X < 2 ^ e,(xiy^'2) - Xin . 

is/'" iG/= 

Since E'(ej) = 0 for all i and < L, this gives 


-2 
^2 ? 


E'(y'(xj/3* - < 2E'( sup y~] EjXjP 

\^IC j V/3GRP:|/3|i<L^g^c 


(4.8) 


For any /3 such that |/3|i < L, 


eiXiji = y] 

iGL= iGr= i=l 


< L max 
i<i<p 


E 


^i^ij 


(4.9) 
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By Lemma lA.21 from the Appendix, 


E'f 


max 

\i<j<p 




^ SiXij ) < (t( 21og(2p) max < cr(2M^/^nlog(2p))^/^ 




1/2 


(4.10) 


The desired result is obtained by combining (I4.8p . (|4.9I1 and (|4.10p . and taking unconditional 
expectation on both sides. □ 

Lemma 4.3. 

J < 3L2(2Mnlog(2p2))^/2^2Lf7(2M^/2^1og(2p))^/^ 

Proof. Since |/3*|i and are both bounded by L, therefore 

i£l i&I<^ /3gKp : |/3|i<L 

Using Lemma l4.ll to bound the hrst term on the right, and Lemma 14.21 to bound the second, we 
get the desired result. □ 

Next we bound the hrst term in p4.ip . 

Lemma 4.4. 


/I _l_ 9-1/2 \ ^/2 

+ (n(n + 5)(T^ + n(n + ( - - -j 

Proof. For each K, let 

W{K) :=^eix#^’2)_ 
iei 

Note that by conditional independence of {si)i£i and given I, we know that the conditional 

distribution of 1F(L) given I and (ei)ig/= is normal with mean zero and variance ■ 

In particular, 

E = E f E" 

(Note that this holds true even if 1 or is empty.) Hence, to prove the lemma, it is enough to 
show that the right-hand side is an upper bound for the expectation of IF(iLi). 

Let K. := {0, S,2S,..., iViJ}. Then 

IF(iLi) < maxVF(A:). (4.11) 

Ke/C 

We know that Ki minimizes ~ among all K G JC. Therefore, in particular, 

Y,(y. - = Ej/? 


since / 5 (o> 2 ) _ Q_ This implies that 

' < ('Ey. 

'iei ^ ^i£i 


iei 


i£l 


f - \ V2 / \ 1/2 

+(Ete-^.T''‘’Ld <2(E9? 


(4.12) 
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Consequently, 


\ 1/2 


iel ^iel i&I 2 ^iel i€l 


1/2 


Let m := 16|/|cj^ + 8 X]jg7(a:j/3*)^, and 


K.' := Ik elC: 

iGl ^ 


Then by ()4.1ip . at least one of the two following inequalities must hold: 

W(ki) < maxW(K), 

K&K.' 

i&I 


(4.13) 


In the latter case, (I4.12p implies that 


and (|4.13p implies that 


i€l 


lT(iLi)<2 



1/2 


Thus, 


lT(iLi)< maxlL(iL) + 2(^^ 


i£l 


'2^yt 

iei 


M4E 


'iel 


(4.14) 


where l^i is our notation for the indicator of an event vl. 

If we condition on I and {ei)i^ic then m and become non-random but (ei)ig 7 are 

still i.i.d. Ai(0, cr^) random variables. Thus, for each K € K,', IT(iL) is conditionally a Gaussian 
random variable with mean zero and variance bounded by ma'^. Therefore by Lemma lA.21 of the 
Appendix, 

E"(m^^ IT(Ar)) < a^j2m log |/C'| < f7Y^2mlog |/C| . 


Taking unconditional expectation gives 


E(maxlT(A:)) < (TV2E(Vmlog |/C|) 

< (TV2(E(m)E(log|/C|))i/2 ^ 


(4.15) 
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Note that 


E(m) = Sncr^ + 4 


2 = 1 


Sno-^ + 4 XijXikl3*l3l 

2 = 1 


< Sncr^ + 4|/3*|f max 


l<j,k<p 
n 


E 

2=1 


^2j ^ik 


On the other hand, 


Therefore by (14.151) . 


< Sncj^ + 4L^ max x?- 

2=1 

< 8no-2 + . 

log |/C| = log(iVi + 1). 


E(max W{K)) < (IGcr^n + SL^M^/‘^a‘^nf/‘^{K{\og{Ni + l)))^/^ _ 

Now we are left to control the second term on the right hand side of (14.1411 . If we call it 

(E(5))2 < m(Y,eiY,yi]^UY,y} > m) . 


ie/ ie/ 


ie/ 


Let 


Then 


: = 


1 if i G I, 

0 if i G 


^ie/ i&I ^ i,j=i 


ty^E(4y2) + ^ E E(£?)E(i/|) 

2=1 

1 1 

9 E(^^^ + + 7 E 




2=1 

n(n + 5)(T'^ (n+l)cr2^ 2 


2=1 


Next, note that by Chebychev’s inequality, 

,2 


' I ^ E 1 < E' exp ( 

i£l 




16cr2 




iei 


(4.16) 


(4.17) 
S then 

(4.18) 


(4.19) 


(4.20) 
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By Lemma lA.31 from the Appendix, 


Thus, 


Therefore by (14.201) . 


= ^g(xi/3*)V2o 


g-m/16cr2 JJjE'('gy|/4o-2^ _ 2 h|/ 2 g-h| 2“!-^!/^ 

iei 


4j2yl>m]< E(2-I^I/2) = []E(2-«^/2) = ( i±l^ 

ie/ ^ i=i ^ 


Combining (14.181) . (14.191) and (14.211) . we get 


(E(S'))^ < (n(n + 5)it^ + n(n + l)a^L 




1 + 2 - 1/2 


V 2 


The proof is completed by combining (|4.1ip , (|4.17l) and (14.221) . 
We are now ready to complete the proof of Theorem 12.11 


(4.21) 


(4.22) 

□ 


Proof of Theorem 12.11 Combine inequality (|4.ip with Lemma 14.31 and Lemma 14.41 to get a bound 
for E(^-gj(/l( — /U*)2; Ni6 > |/3*li), and add a similar term for I'^. This shows that 

E(||/i'-//*||2; Ni5>\P*\i, A^25> iril) 

< 6L2(2Mnlog(2p2))i/2 -|-4L(T(2Mi/2nlog(2p))i/2 

+ (lOcr^re + + ^/h) 

/I _i_ 9-1/2\ «/2 

+ 2(n(n + 5)o-^ + n(n + l)cj2L2Mi/2)i/2f---j . (4.23) 

Now note that by the definition of K, 

||/r*-X/3W||<||/r*-/l'|| + ||A'-X/3W|| 

Thus, 

ll/i* - A/jWf < 8||A' - . (4.24) 

Next observe that by the definition of and the fact that |/?*|i < L, 

||y - xp^^^f < ||y - xp*f = ||e||2. 
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Since |/3*|i and are both bounded by L, the above inequality implies that 

n 

Wfj.* - IP < 2 ^ - Xil3*) 


By Lemma lA.21 


2 = 1 
n p 


- /3. 


i=i j=i 

< AL max 
i<i<p 


^iXij 


2=1 


E( max 
\^<j<p 


E 

2=1 




/ ^ 

< a^y 2 log{ 2 p) max ( 


2=1 


1/2 


< a[2nM^^'^ log(2p))^'^^ . 

Combining (I4.23|] , (I4.24|] , (I4.25|] and (I4.26|l completes the proof of Theorem 12.11 


(4.25) 


(4.26) 

□ 


5. Proofs of Theorem 13.11 and Corollary 
Proof of Theorem l3.11 Let jj ;= . Note that 


1^2 2 \ 

\a — a \ = 


lY — -''.112 _ ^^2 


— na 


n 


|e|P + 2e • {p* - jj) + \\p* - AlP - 


n 


I ||e|p — ncT^I 2||e 
< ^^ + — 


+ 


n n 

Let A be the event that Ni5 and N 25 both exceed |/3*|i. Since E(e|) = and Var(e?) = 2f7^, and 
the efs are independent, therefore 


n 

- 2 \ _ ^2 


E 


|e|P — nfj^l 


n 


A < E 


|e|P — ncr^l 


n 


< I Var 


2 \ \ 1/2 


n 


= cr 


Next, note that by the Cauchy-Schwarz inequality, 

'2||e||||u*-/ill .2 

n 


E 


n 


A) <-(E(||£||2)E(||f,--A||2;A)) 

n 


1/2 




An application of Theorem 12.11 completes the proof. 


□ 


Proof of CoroUarv \3.2[ This is simply a question of verifying that the error bound tends to zero 
as n —>■ oo. By assumptions (i) and (ii), the quantities Ci and C 2 remain bounded and En tends 
to zero, and the events Ni5 > |/?*|i and N 26 > |/3*|i have probability tending to one as n —)• 00 . 
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Finally by assumption (iii), s/lijn, ^/h/n and Y^log(2p)/n all tend to zero. To complete the proof, 
let A be the event that both Ni6 and N 2 d exceed |/3*|i, and note that for any s > 0, 

P(|(t2 - (t^I > s) < P({|(t2 - > s} n ^) + 

< A)+F{A‘^). 

s 

We have argued above that both terms in the last line tend to zero. This completes the proof. □ 


Appendix 


This appendix contains a few simple lemmas that have been used several times in the proof of 
Theorem O All of these are well-known results. We give proofs for the sake of completeness. 

Lemma A.l. Suppose that Ci,..., are independent, mean zero random variables, and 71 ,..., 7 ^ 
are constants such that |^j| < 7 * almost surely for each i. Then for each 6 gF, 




Proof. By independence, it suffices to prove the lemma for m = 1. Also, without loss of generality, 
we may take 6 = 1. Accordingly, let ^ be a random variable and 7 be a constant such that |^| < 7 
almost surely. Let 

«:=-( 1 - 


7 

Since |^| < 7 , therefore a E [0,1]. Thus, by the convexity of the exponential map, 

e« = < ae-T' + (1 - a)e'^. 

Taking expectation on both sides, and using the assumption that E(^) = 0, we get 

E(e^) < coshy. 

It is easy to verify that coshy < ef' 1'^ by power series expansion. 


□ 


Lemma A. 2. Suppose that ^ 1 , ... are mean zero random variables, and a is a constant such 
that E(e®^*) < 1'^ for each 6 gM.. Then 


and 


Proof. Note that for any 6 > 0, 


E( max ^i) < iTy^2 log m 

l<i<m 


E( max 1^2 I) <uv' 21 og (2 m) 


E( max ^i) = -E(loge®“'"^i<»<-«0 

l<2<m 0 


< 


^Eflog^^ 


9 \ 
1 


2=1 


2 = 1 

log m 9a‘^ 

< — -^- 

2 


Mi 


Mi^ 


9 
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The proof of the hrst inequality is completed by choosing 6 = a ^v^21ogm. The second inequality 
is proved by applying the first inequality to the collection (^i,..., • • • j —Cm- D 

Lemma A.3. If Z ^ then for any a > 1, 

Proof. This is a simple Gaussian computation. Just note that since 

/*rin 

-12 /9o2 


g-(x-a) /2/3 ^ ^2^ 


for any a and /?, therefore 


f 


/ 

/ 


gx2/2ao-2^_(a;-^)2/2cr2^^ _ j _a{x-P)^)/2aa'^ 

-{a-l)(x-^ia/ (a-l))2 12ap +y?I2{a-l)p 


—oo 
2 / 


— gli /2(a-l)o-^ 


2 /27ra(T^ 


a — 1 


Now divide on both sides by \Phx^ to complete the proof. 


□ 
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